Skip to content

Stabilize integration tests and fail fast on persistent REST protocol errors#2082

Open
fressi-elastic wants to merge 18 commits intoelastic:masterfrom
fressi-elastic:improve-integration-tests-stability
Open

Stabilize integration tests and fail fast on persistent REST protocol errors#2082
fressi-elastic wants to merge 18 commits intoelastic:masterfrom
fressi-elastic:improve-integration-tests-stability

Conversation

@fressi-elastic
Copy link
Copy Markdown
Contributor

@fressi-elastic fressi-elastic commented Mar 30, 2026

  • REST wait: Stop burning the full retry window on repeated HTTP/HTTPS-style protocol errors; cap consecutive protocol errors and surface a clear setup error.
  • IT benchmark port: Free Elasticsearch cluster port before running Rally by killing the JVM after founding out the port is being held by rally benchmark cluster. It performs this operation before and after the test cases (it uses a fixture).
  • IT clusters: Probe port with retries; wait for yellow (or configured status) with client timeout aligned to ES wait; reuse metrics store when something already listens on 10200.
  • IT environment: Force an empty global gitconfig for the session so local ~/.gitconfig cannot skew git/network behavior in tests.
  • Python & tar: Gate TarFile.extractall's filter= argument to 3.12+ for runtime and typing.
  • Tests: Update and extend wait_for_rest_layer tests for the new protocol-error cap.

Add TestCluster.is_cluster_running_on_port() to detect our cluster by
cluster name. EsMetricsStore.start() reuses a running node on HTTP_PORT
and only waits for yellow health; otherwise behavior is unchanged.
Document that wait_for_cluster_health needs http_port set.

Made-with: Cursor
Added TestCluster.is_cluster_running_on_port() method to check if a cluster is active on a specified port. Updated EsMetricsStore.start() to utilize a running node on HTTP_PORT and only wait for yellow health, while maintaining existing behavior otherwise. Updated documentation for wait_for_cluster_health to require http_port setting.
Add isolate_git_global_config autouse session fixture that sets
GIT_CONFIG_GLOBAL to it/resources/gitconfig-empty for all integration
tests. shared_setup depends on it so isolation runs first.

Remove per-test GIT_CONFIG_GLOBAL handling from test_run_without_http_connection.

Made-with: Cursor
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve integration test stability by reducing flakiness from transient Elasticsearch startup errors and by isolating integration tests from developer-specific Git configuration.

Changes:

  • Retry transient-looking protocol connection errors in wait_for_rest_layer() before raising a scheme-mismatch SystemSetupError.
  • Reuse an already-running Elasticsearch “metrics store” cluster (on a fixed port) and add an explicit wait for cluster health.
  • Isolate integration tests from ~/.gitconfig by setting GIT_CONFIG_GLOBAL to an empty config file.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
esrally/client/factory.py Adjusts connection error handling to retry protocol errors during startup and improves debug logging.
tests/client/factory_test.py Updates unit test to match the new retry-then-fail behavior for protocol errors and avoids real sleeping.
it/conftest.py Reuses an existing metrics-store cluster on port 10200 and waits for yellow health.
it/__init__.py Adds helpers to detect an already-running cluster by port and to wait for cluster health.
it/basic_test.py Ensures integration test subprocesses ignore user git config by setting GIT_CONFIG_GLOBAL.
it/resources/gitconfig-empty Adds an empty gitconfig resource intended to be used for GIT_CONFIG_GLOBAL isolation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@fressi-elastic fressi-elastic marked this pull request as ready for review March 30, 2026 23:14
@fressi-elastic fressi-elastic added the bug Something's wrong label Mar 30, 2026
@fressi-elastic fressi-elastic added this to the 2.14.0 milestone Mar 30, 2026
@fressi-elastic fressi-elastic requested review from a team, Copilot and pquentin March 31, 2026 02:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Use request_timeout=timeout+30s so HTTP client limits stay above the
Elasticsearch cluster.health wait parameter. Express timeout in seconds
(float) for callers; metrics store IT uses defaults.

Made-with: Cursor
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Retry is_cluster_running_on_port with bounded attempts, request timeout,
  and debug logging on failure (integration test stability).
- Cap consecutive protocol-like ConnectionErrors in wait_for_rest_layer
  so scheme misconfiguration fails faster than full max_attempts.

Made-with: Cursor
@fressi-elastic fressi-elastic force-pushed the improve-integration-tests-stability branch from 1a6d4ee to 8e8d5a7 Compare March 31, 2026 08:34
TarFile.extractall(filter=...) is only available from Python 3.12 (PEP 706).
Pass the filter only when supported so decompression works on 3.10 and 3.11.

Made-with: Cursor
Replace **params dict with explicit Python 3.12+ branches so filter
is typed as a literal and mypy accepts TarFile.extractall.

Made-with: Cursor
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Elasticsearch returns HTTP 200 with timed_out:true when the server-side
wait window expires without reaching wait_for_status. Treat that as a
failed wait and raise AssertionError with status context.

Made-with: Cursor
@fressi-elastic fressi-elastic changed the title Improve integration tests stability Stabilize integration tests and fail fast on persistent REST protocol errors Mar 31, 2026
Document and annotate wait_for_rest_layer protocol-error handling,
_do_tar_decompress PEP 706 behavior, TestCluster probe/health helpers,
and session pytest fixtures (Generator return types).

Made-with: Cursor
Use an absolute it/ directory for GIT_CONFIG_GLOBAL so subprocesses find
gitconfig-empty regardless of cwd.

Expose cluster probing as probe_cluster_on_port (returns cluster name or None);
is_cluster_running_on_port compares to cfg. EsMetricsStore.start fails fast
when the port serves another cluster or a non-ES listener before install.

Remove a useless pylint suppression on configure_file_handler.

Made-with: Cursor
Explain in the docstring and inline comment that extractall(filter=)
is only available from 3.12 (PEP 706) and that the legacy path lacks
member-path filtering, so untrusted archives could escape target_directory.

Made-with: Cursor
Add helpers to clear Rally-owned Elasticsearch on the IT benchmark port (19200) and expose it via pytest fixtures that yield the port number.

stop_rally_provisioned_es_on_port returns None; ensure_benchmark_http_port_free must not assign that result back to http_port, which previously made fixtures yield None and broke --target-hosts.

Update distribution, sources, tracker, and related tests to use the fixture port consistently.

Made-with: Cursor
@fressi-elastic fressi-elastic enabled auto-merge (squash) April 1, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something's wrong

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants